目录
1.并行计算:使用OpenMP进行并行计算,默认使用计算机的所有核
2.正则化:XGBoost最大的优势在于可以正则化,防止过拟合
3.交叉验证:XGBoost内置交叉验证函数
4.缺失值:XGBoost可以处理缺失值,模型可以捕捉到缺失值蕴含的趋势
5.灵活性:支持用户自定义目标函数和评价指标
6.可获得性:XGBoost支持R, Python, Java, Julia, Scala 等语言
7.保存和载入:XGBoost可以保存和载入数据矩阵和模型
8.剪枝:XGBoost首先建立最大深度的树,再自下而上剪去损失函数的减少低于阈值的树枝
1.分类问题:使用booster = gbtree参数。每一棵树都是在之前树的基础上建立,通过给之前的树误分的点赋予更高的权重,来降低接下来轮次中的误分率。
2.回归问题:使用booster = gbtree 和 booster = gblinear参数。使用gblinear参数时,建立广义线性模型,并使用(L1,L2)正则化和梯度下降法。后续的模型都是对之前模型的残差进行拟合。
XGBoost的参数可以被分为3类:
设置booster类型(gbtree, gblinear, dart)。对于分类问题,可以使用gbtree, gblinear;对于回归问题,可以使用任何类型。
启动并行计算。通常不需要改变这一参数,因为默认使用所有核,可以带来最快的计算速度。
如果设为1,R console会被运行信息淹没,最好不要改变改变这一参数。
调参策略:
1.选择较高的学习速率(eta)。一般情况下,初始学习速率的值为0.1。但是,对于不同的问题,理想的学习速率有时候会在0.05到0.3之间波动。通过xgb.cv函数的early_stopping_rounds参数来控制最优的决策树数量(nrounds)。对于给定的学习速率,进行决策树特定参数调优(max_depth, min_child_weight, gamma, subsample, colsample_bytree)。
2.XGBoost的正则化参数的调优。(lambda, alpha)。这些参数可以降低模型的复杂度,从而提高模型的表现。
3.结合scale_pos_weight参数,在利用贪婪算法调整得到的参数组合附近,对所有参数进行网格搜索。
4.降低学习速率,确定理想的决策树数量(nrouds)。
library(ROCR)
calc_auc_and_ks <- function(pred, y) {
pred.obj1 <- ROCR::prediction(pred, y)
## AUC
auc.tmp1 <- performance(pred.obj1, "auc")
auc1 <- as.numeric(auc.tmp1@y.values)
## KS
roc.tmp1 <- performance(pred.obj1, "tpr", "fpr")
ks <- max(attr(roc.tmp1, "y.values")[[1]] - attr(roc.tmp1, "x.values")[[1]])
# print(c(auc1, ks))
return(list(auc1, ks))
}
# get cv-auc, cv-ks:
# cv_prediction <- xgb$pred
# calc_auc_and_ks(cv_prediction, y_train)
grid_search <- function(dtrain, y_train,
seed = 10, nthread=20, missing=NA, nrounds=10000, early_stopping_rounds=50, nfold=5, stratified=T, verbose=F, prediction = T,
eta=c(0.1),
max_depth=c(6),
min_child_weight=c(1),
gamma= c(0),
subsample=c(1),
colsample_bytree =c(1),
lambda = c(1),
alpha =c(0),
scale_pos_weight=c(1)){
# create output data.frame:param(sep by ,), auc, ks, auc_rank, ks_rank.
output_df <- data.frame(t(rep(NA,11)))
names(output_df) <- c("eta", "max_depth","min_child_weight","gamma","subsample","colsample_bytree","lambda","alpha","scale_pos_weight","cv_auc","cv_ks")
rowkey <-1
# create parameters grid
to_tune = expand.grid(eta = eta,
max_depth = max_depth,
min_child_weight = min_child_weight,
gamma = gamma,
subsample = subsample,
colsample_bytree = colsample_bytree,
lambda = lambda,
alpha = alpha,
scale_pos_weight = scale_pos_weight)
# for loop
for (i in seq(dim(to_tune)[1])) {
xgb_params = list(
objective = "binary:logistic",
eval_metric = 'auc')
xgb_params$eta = to_tune[i, 1]
xgb_params$max_depth = to_tune[i, 2]
xgb_params$min_child_weight = to_tune[i, 3]
xgb_params$gamma = to_tune[i, 4]
xgb_params$subsample = to_tune[i, 5]
xgb_params$colsample_bytree = to_tune[i, 6]
xgb_params$lambda = to_tune[i, 7]
xgb_params$alpha = to_tune[i, 8]
xgb_params$scale_pos_weight = to_tune[i, 9]
set.seed(seed)
start_tm <-Sys.time()
xgb = xgb.cv(data = dtrain,
params = xgb_params,
nthread = nthread,
missing = missing,
nrounds = nrounds,
early_stopping_rounds = early_stopping_rounds,
nfold = nfold,
stratified = stratified,
verbose = verbose,
prediction = prediction
)
end_tm<-Sys.time()
# print(paste0(rowkey, ' run time:', end_tm - start_tm))
# get cv-auc, cv-ks:
cv_prediction <- xgb$pred
list_auc_ks <- calc_auc_and_ks(cv_prediction, y_train)
auc_i <- list_auc_ks[[1]]
ks_i <- list_auc_ks[[2]]
# fill the output_df
output_df[rowkey,] <- as.data.frame(t(c(xgb_params$eta, xgb_params$max_depth, xgb_params$min_child_weight, xgb_params$gamma, xgb_params$subsample, xgb_params$colsample_bytree, xgb_params$lambda, xgb_params$alpha, xgb_params$scale_pos_weight, auc_i, ks_i)))
# print(paste0("rowkey=", rowkey, ">>parameters:eta=", xgb_params$eta, ",max_depth=", xgb_params$max_depth, ",min_child_weight=", xgb_params$min_child_weight, ",gamma=", xgb_params$gamma, ",subsample=", xgb_params$subsample, ",colsample_bytree=", xgb_params$colsample_bytree, ",lambda=", xgb_params$lambda, ",alpha=", xgb_params$alpha, ",scale_pos_weight=", xgb_params$scale_pos_weight))
# print(paste0("rowkey=",rowkey,": auc=",auc_i,", ks=",ks_i,"...."))
rowkey <- rowkey +1
}
# Rank the auc, ks descending.
output_df[,"desc_rank_auc"] <- dim(output_df)[1]+1 - as.data.frame(rank(output_df[,"cv_auc"]))
output_df[,"desc_rank_ks"] <- dim(output_df)[1]+1 - as.data.frame(rank(output_df[,"cv_ks"]))
return(output_df)
}
library(xgboost)
library(dplyr)
df_train = read.csv("data/cs-training.csv", stringsAsFactors = FALSE) %>%
na.omit() %>% # delete the missing value
select(-`X`) # delete the first index column
train_data = as.matrix(df_train %>% select(-SeriousDlqin2yrs))
train_label = df_train$SeriousDlqin2yrs
dtrain <- xgb.DMatrix(data = train_data, label = train_label)
max_depth 、 min_child_weight 、 gamma 、 subsample 、 colsample_bytree
先大范围地粗调参数,然后再小范围地微调;取决于机器的性能,可以适当放宽网格搜索的范围、减少参数的步长。
grid_search_result = grid_search(dtrain, train_label,
eta = c(0.1),
max_depth = c(3, 5, 7, 9), # initial value set to [3-9]
min_child_weight = c(1, 3, 5), # initial value set to [1-5]
gamma = c(0), # initial value set to 0
subsample = c(0.8), # typical initial value set to 0.8, can be set to [0.5, 0.9]
colsample_bytree = c(0.8), # typical initial value set to 0.8, can be set to [0.5, 0.9]
lambda = c(1),
alpha = c(0),
scale_pos_weight = c(1))
grid_search_result
grid_search_result %>% filter(desc_rank_auc == 1)
理想的max_depth值为3,理想的min_child_weight值为5,但是我们还没尝试过小于3的max_depth取值和大于5的min_child_weight取值,所以继续在这一参数组合附近搜索。
grid_search_result = grid_search(dtrain, train_label,
eta = c(0.1),
max_depth = c(1, 3, 5),
min_child_weight = c(3, 5, 7),
gamma = c(0),
subsample = c(0.8),
colsample_bytree = c(0.8),
lambda = c(1),
alpha = c(0),
scale_pos_weight = c(1))
grid_search_result
grid_search_result %>% filter(desc_rank_auc == 1)
理想的max_depth值仍然为3,理想的min_child_weight值仍然为5。在这个参数组合附近进一步调整,将步长设置为1,寻找理想的参数组合。
grid_search_result = grid_search(dtrain, train_label,
eta = c(0.1),
max_depth = c(2, 3, 4),
min_child_weight = c(4, 5, 6),
gamma = c(0),
subsample = c(0.8),
colsample_bytree = c(0.8),
lambda = c(1),
alpha = c(0),
scale_pos_weight = c(1))
grid_search_result
grid_search_result %>% filter(desc_rank_auc == 1)
最终确定,理想的max_depth值为3,理想的min_child_weight值为5。
grid_search_result = grid_search(dtrain, train_label,
eta = c(0.1),
max_depth = c(3),
min_child_weight = c(5),
gamma = c(0, 0.01, 0.1, 1, 3, 5, 10, 20),
subsample = c(0.8),
colsample_bytree = c(0.8),
lambda = c(1),
alpha = c(0),
scale_pos_weight = c(1))
grid_search_result
grid_search_result %>% filter(desc_rank_auc == 1)
理想的gamma值为3,在这一参数值附近进一步调整,将步长设置为1。
grid_search_result = grid_search(dtrain, train_label,
eta = c(0.1),
max_depth = c(3),
min_child_weight = c(5),
gamma = c(2, 3, 4),
subsample = c(0.8),
colsample_bytree = c(0.8),
lambda = c(1),
alpha = c(0),
scale_pos_weight = c(1))
grid_search_result
grid_search_result %>% filter(desc_rank_auc == 1)
最终确定,理想的gamma值为3。
在0.6到1.0的范围内,以0.1为步长,对subsample和colsample_bytree进行网格搜索。
grid_search_result = grid_search(dtrain, train_label,
eta = c(0.1),
max_depth = c(3),
min_child_weight = c(5),
gamma = c(3),
subsample = c(0.6, 0.7, 0.8, 0.9, 1.0),
colsample_bytree = c(0.6, 0.7, 0.8, 0.9, 1.0),
lambda = c(1),
alpha = c(0),
scale_pos_weight = c(1))
grid_search_result
grid_search_result %>% filter(desc_rank_auc == 1)
最终确定,理想的subsample值为0.7,理想的colsample_bytree值为0.8。
gamma参数提供了一种更加有效地降低过拟合的方法,因而alpha和lambda参数可以进行较为粗略的调整。
grid_search_result = grid_search(dtrain, train_label,
eta = c(0.1),
max_depth = c(3),
min_child_weight = c(5),
gamma = c(3),
subsample = c(0.7),
colsample_bytree = c(0.8),
lambda = c(0, 1e-5, 1e-2, 0.1, 1, 100),
alpha = c(0, 1e-5, 1e-2, 0.1, 1, 100),
scale_pos_weight = c(1))
grid_search_result
grid_search_result %>% filter(desc_rank_auc == 1)
理想的lambda值为100,理想的alpha值为1。
weight = (length(train_label) - sum(train_label)) / sum(train_label)
grid_search_result = grid_search(dtrain, train_label,
eta = c(0.1),
max_depth = c(2, 3, 4),
min_child_weight = c(4, 5, 6),
gamma = c(2, 3, 4),
subsample = c(0.6, 0.7, 0.8),
colsample_bytree = c(0.7, 0.8, 0.9),
lambda = c(100),
alpha = c(1),
scale_pos_weight = c(1, weight))
grid_search_result
grid_search_result %>% filter(desc_rank_auc == 1)
最终确定,max_depth值为4,min_child_weight值为5,gamma值为3,subsample值为0.6,colsample_bytree为0.9,lambda值为100,alpha值为1,scale_pos_weight值为1。
以10倍为步长降低学习速率,eta分别取0.001, 0.01, 0.1进行搜索。
grid_search_result = grid_search(dtrain, train_label,
eta = c(0.001, 0.01, 0.1),
max_depth = c(4),
min_child_weight = c(5),
gamma = c(3),
subsample = c(0.6),
colsample_bytree = c(0.9),
lambda = c(100),
alpha = c(1),
scale_pos_weight = c(1))
grid_search_result
grid_search_result %>% filter(desc_rank_auc == 1)
理想的eta值为0.1,在0.1附近以0.05为步长进一步搜索。
grid_search_result = grid_search(dtrain, train_label,
eta = c(0.05, 0.1, 0.15, 0.2, 0.25, 0.3),
max_depth = c(4),
min_child_weight = c(5),
gamma = c(3),
subsample = c(0.6),
colsample_bytree = c(0.9),
lambda = c(100),
alpha = c(1),
scale_pos_weight = c(1))
grid_search_result
grid_search_result %>% filter(desc_rank_auc == 1)
理想的eta值为0.1,在0.1附近以0.01为步长进一步搜索。
grid_search_result = grid_search(dtrain, train_label,
eta = c(0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14),
max_depth = c(4),
min_child_weight = c(5),
gamma = c(3),
subsample = c(0.6),
colsample_bytree = c(0.9),
lambda = c(100),
alpha = c(1),
scale_pos_weight = c(1))
grid_search_result
grid_search_result %>% filter(desc_rank_auc == 1)
最终确定,理想的eta值为0.12,找出该eta值下对应的nrounds。
bst_params = list(objective = "binary:logistic",
eval_metric = 'auc',
eta = c(0.12),
max_depth = c(4),
min_child_weight = c(5),
gamma = c(3),
subsample = c(0.6),
colsample_bytree = c(0.9),
lambda = c(100),
alpha = c(1),
scale_pos_weight = c(1))
set.seed(10)
bst_cv = xgb.cv(data = dtrain,
params = bst_params,
# nthread = 20,
missing = NA,
nrounds = 10000,
early_stopping_rounds = 50,
nfold = 5,
stratified = T,
verbose = F,
prediction = T
)
calc_auc_and_ks(bst_cv$pred, train_label)
[[1]]
[1] 0.8566172
[[2]]
[1] 0.5591036
bst_cv$niter
[1] 259
bst_cv$evaluation_log
最终确定,理想的nrounds为259。
library(caret)
library(e1071)
xgb_pred <- ifelse(bst_cv$pred > 0.5, 1, 0)
confusionMatrix(xgb_pred, train_label)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 110803 6741
1 1109 1616
Accuracy : 0.9347
95% CI : (0.9333, 0.9361)
No Information Rate : 0.9305
P-Value [Acc > NIR] : 3.372e-09
Kappa : 0.2666
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9901
Specificity : 0.1934
Pos Pred Value : 0.9427
Neg Pred Value : 0.5930
Prevalence : 0.9305
Detection Rate : 0.9213
Detection Prevalence : 0.9773
Balanced Accuracy : 0.5917
'Positive' Class : 0
library(pROC)
modelroc = roc(train_label, bst_cv$pred, thresholds = 0.5)
plot(modelroc,print.auc=T,auc.polygon=T,grid=c(0.1,0.2),grid.col=c('green','red'),max.auc.polygon=T,auc.polygon.col='skyblue',print.thres=T)
library(tidyr)
bst_cv$evaluation_log %>%
select(-contains("std")) %>%
gather(TestOrTrain, AUC, -iter) %>%
ggplot(aes(x = iter, y = AUC, group = TestOrTrain, color = TestOrTrain)) +
geom_line() +
theme_bw()
使用调节好的参数,通过xgb.train,在全部训练集上训练模型
bst_params = list(objective = "binary:logistic",
eval_metric = 'auc',
eta = c(0.12),
max_depth = c(4),
min_child_weight = c(5),
gamma = c(3),
subsample = c(0.6),
colsample_bytree = c(0.9),
lambda = c(100),
alpha = c(1),
scale_pos_weight = c(1))
bst = xgb.train(data = dtrain,
params = bst_params,
# nthread = 20,
missing = NA,
nrounds = 259,
verbose = F
)
importance <- xgb.importance(feature_names = colnames(train_data), model = bst)
importance
Gain is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite).
Cover measures the relative quantity of observations concerned by a feature.
Frequency is a simpler way to measure the Gain. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it).
xgb.plot.importance(importance_matrix = importance)
importanceRaw <- xgb.importance(feature_names = colnames(train_data), model = bst, data = train_data, label = train_label)
with=FALSE ignored, it isn't needed when using :=. See ?':=' for examples.
importanceClean <- importanceRaw[,`:=`(Cover=NULL, Frequency=NULL)]
bst_params = list(objective = "binary:logistic",
eval_metric = 'auc',
eta = c(0.12),
max_depth = c(4),
min_child_weight = c(5),
gamma = c(3),
subsample = c(0.6),
colsample_bytree = c(0.9),
lambda = c(100),
alpha = c(1),
scale_pos_weight = c(1))
bst_graph = xgb.train(data = dtrain,
params = bst_params,
# nthread = 20,
missing = NA,
nrounds = 2,
verbose = F
)
library(DiagrammeR)
xgb.plot.tree(model = bst_graph)
随机森林和梯度提升决策树都属于集成算法。两种算法都要在一个数据集上训练多个决策树,两者的区别在于:随机森林中的每一棵决策树是独立的,而在梯度提升决策树中,每一棵树都是在对前一棵树进行修正。
通过XGBoost也可以实现随机森林,下面建立由1000棵决策树组成的随机森林。
rf <- xgb.train(data = dtrain, max_depth = 4, num_parallel_tree = 1000, subsample = 0.8, colsample_bytree =0.8, nrounds = 1, objective = "binary:logistic", eval_metric = 'auc')
rf_pred <- predict(rf, dtrain)
calc_auc_and_ks(rf_pred, train_label)
[[1]]
[1] 0.8483992
[[2]]
[1] 0.5391836
modelroc = roc(train_label, rf_pred, thresholds = 0.5)
plot(modelroc,print.auc=T,auc.polygon=T,grid=c(0.1,0.2),grid.col=c('green','red'),max.auc.polygon=T,auc.polygon.col='skyblue',print.thres=T)
rf_pred <- predict(bst, dtrain)
rf_pred <- if_else(rf_pred > 0.5, 1, 0)
confusionMatrix(rf_pred, train_label)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 111396 7377
1 516 980
Accuracy : 0.9344
95% CI : (0.933, 0.9358)
No Information Rate : 0.9305
P-Value [Acc > NIR] : 5.774e-08
Kappa : 0.1817
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9954
Specificity : 0.1173
Pos Pred Value : 0.9379
Neg Pred Value : 0.6551
Prevalence : 0.9305
Detection Rate : 0.9262
Detection Prevalence : 0.9876
Balanced Accuracy : 0.5563
'Positive' Class : 0
importance <- xgb.importance(feature_names = colnames(train_data), model = rf)
importance
xgb.plot.importance(importance_matrix = importance)